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Abstract. This article is a self-contained introduction to the Metropolis- 
Hastings algorithm, this ubiquitous tool for producing dependent simula¬ 
tions from an arbitrary distribution. The document illustrates the principles 
of the methodology on simple examples with R codes and provides entries 
to the recent extensions of the method. 
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1. INTRODUCTION 

There are many reasons why computing an integral like 

3(h) = / h(x)dir(x), 

Jx 

where d7r is a probability measure, may prove intractable, from the shape of 
the domain X to the dimension of X (and x), to the complexity of one of the 
functions h or 7r. Standard numerical methods may be hindered by the same 
reasons. Similar difficulties (may) occur when attempting to find the extrema 
of 7T over the domain X. This is why the recourse to Monte Carlo methods 
may prove unavoidable: exploiting the probabilistic nature of ir and its weighting 
of the domain X is often the most natural and most efficient way to produce 
approximations to integrals connected with ir and to determine the regions of 
the domain X that are more heavily weighted by n. The Monte Carlo approach 
(Hammersley and Handscomb, 1964; Rubinstein, 1981) emerged with computers, 
at the end of WWII, as it relies on the ability of producing a large number of 
realisations of a random variable distributed according to a given distribution, 
taking advantage of the stabilisation of the empirical average predicted by the Law 
of Large Numbers. However, producing simulations from a specific distribution 
may prove near impossible or quite costly and therefore the (standard) Monte 
Carlo may also face intractable situations. 

An indirect approach to the simulation of complex distributions and in par¬ 
ticular to the curse of dimensionality met by regular Monte Carlo methods is to 
use a Markov chain associated with this target distribution, using Markov chain 
theory to validate the convergence of the chain to the distribution of interest and 
the stabilisation of empirical averages (Meyn and Tweedie, 1994). It is thus little 
surprise that Markov chain Monte Carlo (MCMC) methods have been used for 
almost as long as the original Monte Carlo techniques, even though their impact 
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on Statistics has not been truly felt until the very early 1990s. A comprehensive 
entry about the history of MCMC methods can be found in Robert and Casella 
( 2010 ). 

The paper 1 is organised as follows: in Section 2, we define and justify the 
Metropolis-Hastings algorithm, along historical notes about its origin. In Section 
3, we provide details on the implementation and calibration of the algorithm. 
A mixture example is processed in Section 4. Section 5 includes recent exten¬ 
sions of the standard Metropolis-Hastings algorithm, while Section 6 concludes 
about further directions for Markov chain Monte Carlo methods when faced with 
complex models and huge datasets. 

2. THE ALGORITHM 

2.1 Motivations 

Given a probability density n called 
the target , defined on a state space 
X, and computable up to a mul¬ 
tiplying constant, ir(x) oc 7f(x), 
the Metropolis-Hastings algorithm, 
named after Metropolis et al. (1953) 
and Hastings (1970), proposes a 
generic way to construct a Markov 
chain on X that is ergodic and sta¬ 
tionary with respect to it —meaning 
that, if A® ~ 7r(x), then ~ 

7 r(x)—and that therefore converges 
in distribution to ir. While there 
are other generic ways of delivering 
Markov chains associated with an ar¬ 
bitrary stationary distribution, see, 
e.g., Barker (1965), the Metropolis- 
Hastings algorithm is the workhorse of MCMC methods, both for its simplicity 
and its versatility, and hence the first solution to consider in intractable situa¬ 
tions. The main motivation for using Markov chains is that they provide shortcuts 
in cases where generic sampling requires too much effort from the experimenter. 
Rather than aiming at the “big picture” immediately, as an accept-reject algo¬ 
rithm would do (Robert and Casella, 2009), Markov chains construct a progres¬ 
sive picture of the target distribution, proceeding by local exploration of the state 
space X until all the regions of interest have been uncovered. An analogy for the 
method is the case of a visitor to a museum forced by a general blackout to watch 
a painting with a small torch. Due to the narrow beam of the torch, the person 
cannot get a global view of the painting but can proceed along this painting until 
all parts have been seen.- 2 

Before describing the algorithm itself, let us stress the probabilistic foundations 
of Markov chain Monte Carlo (MCMC) algorithms: the Markov chain returned 

1 I am most grateful to Alexander Ly, Department of Psychological Methods, University of 
Amsterdam, for pointing out mistakes in the R code of an earlier version of this paper. 

2 Obviously, this is only an analogy in that a painting is more than the sum of its parts! 
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Fig 1. Fit of the histogram of a Metropolis- 
Hastings sample to its target, for T = 10 4 it¬ 
erations, a scale a = 1, and a starting value 
x (1) = 3.14. 
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by the method, X( l \x( 2 \ ..., X^\ .. . is such that X^ is converging to 7 r. This 
means that the chain can be considered as a sample, albeit a dependent sam¬ 
ple, and approximately distributed from it. Due to the Markovian nature of the 
simulation, the first values are highly dependent on the starting value X^ and 
usually removed from the sample as burn-in or warm-up. While there are very few 
settings where the time when the chain reaches stationarity can be determined, 
see, e.g., Hobert and Robert (2004), there is no need to look for such an instant 
since the empirical average 

(1) j T (h) = ^ J>pr (t) ) 

t =1 


converges almost surely to 3(h), no matter what the starting value, if the Markov 
chain is ergodic, i.e., forgets about its starting value. This implies that, in theory, 
simulating a Markov chain is intrinsically equivalent to a standard i.i.d. simulation 
from the target, the difference being in a loss of efficiency, i.e., in the necessity to 
simulate more terms to achieve a given variance for the above Monte Carlo esti¬ 
mator. The foundational principle for MCMC algorithms is thus straightforward, 
even though the practical implementation of the method may prove delicate or 


in cases impossible. 

Without proceeding much further 
into Markov chain theory, we stress 
that the existence of a stationary dis¬ 
tribution for a chain implies this chain 
automatically enjoys a strong stability 
called irreducibility. Namely, the chain 
can move all over the state space, i.e., 
can eventually reach any region of the 
state space, no matter its initial value. 

2.2 The algorithm 

The Metropolis-Hastings algorithm 
associated with a target density 7r re¬ 
quires the choice of a conditional den¬ 
sity q also called proposal or candidate 
kernel. The transition from the value 
of the Markov chain (X^) at time t 
and its value at time t + 1 proceeds 
via the following transition step: 
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Fig 2. Fit of the histogram of a Metropolis- 
Hastings sample to its target, for T = 10 4 it¬ 
erations , a scale a = 0.1, and a starting value 
= 3.14. 


Algorithm 1. Metropolis-Hastings 
Given JW = x'W, 


1. Generate Y t rsj q(y\x (t) ). 

2. Take 
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Y t 

x {t) 


with probability p(x^,Yt), 
with probability 1 — p(x^\Yt), 
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where 


p(x, y) = min 


n(y) q(x\y) A 

n(x) q(y\x ) ’ J ' 


Then, as shown in Metropolis et al. 

(1953), this transition preserves the 
stationary density it if the chain is irre¬ 
ducible, that is, if q has a wide enough 
support to eventually reach any region 
of the state space X with positive mass 
under n. A sufficient condition is that 
q is positive everywhere. The very na¬ 
ture of accept-reject step introduced 
by those authors is therefore sufficient 
to turn a simulation from an almost ar¬ 
bitrary proposal density q into a gen¬ 
eration that preserves ir as the sta¬ 
tionary distribution. This sounds both 
amazing and too good to be true! 

But it is true, in the theoretical sense 
drafted above. In practice, the perfor¬ 
mances of the algorithm are obviously highly dependent on the choice of the 
transition q. since some choices see the chain unable to converge in a manageable 
time. 

2.3 An experiment with the algorithm 

To capture the mechanism behind the algorithm, let us consider an elementary 
example: 

Example 1. Our target density is a perturbed version of the normal A/"(0,1) 
density, tp(-), 

7t(x) = sin 2 (x) x sin 2 (2x) x <p(x). 

And our proposal is a uniform U(x — a, x + a) kernel, 

q(y\x) ,-y^-(x—a,x+a) (y) • 

Implementing this algorithm is straightforward: two functions to define are the 
target and the transition 


target=function(x){ 

sin(x) ''2*sin(2*x) ''2*dnorm(x) > 

metropolis=function(x,alpha=l){ 
y=runif(1,x-alpha,x+alpha) 
if (runif(1)>target(y)/target(x)) y=x 
return(y)} 
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Fig 3. Fit of the histogram of a Metropolis- 
Hastings sample to its target, for T = 10 5 it¬ 
erations, a scale a = 0.2, and a starting value 
x w = 3.14. 
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and all we need is a starting value 


T=10~4 

x=rep(3.14,T) 

for (t in 2:T) x [t] metropolis (x [t-1]) 


which results in the histogram of Figure 1, where the target density is properly 
normalised by a numerical integration. If we look at the sequence (aW) returned 
by the algorithm, it changes values around 5000 times. This means that one 
proposal out of two is rejected. If we now change the scale of the uniform to 
a = 0.1, the chain ( x takes more than 9000 different values, however the 
histogram in Figure 2 shows a poor fit to the target in that only one mode is 
properly explored. The proposal lacks the power to move the chain far enough to 
reach the other parts of the support of vr(-). A similar behaviour occurs when we 
start at 0. A last illustration of the possible drawbacks of using this algorithm is 
shown on Figure 3: when using the scale a = 0.2 the chain is slow in exploring the 
support, hence does not reproduce the correct shape of it after T = 10 5 iterations. 
◄ 


From this example, we learned that some choices of proposal kernels work well 
to recover the shape of the target density, while others are poorer, and may even 
fail altogether to converge. Details about the implementation of the algorithm 
and the calibration of the proposal q are detailed in Section 3. 

2.4 Historical interlude 

The initial geographical localisation of the MCMC algorithms is the nuclear 
research laboratory in Los Alamos, New Mexico, which work on the hydrogen 
bomb eventually led to the derivation Metropolis algorithm in the early 1950s. 
What can be reasonably seen as the first MCMC algorithm is indeed the Metropo¬ 
lis algorithm, published by Metropolis et al. (1953). Those algorithms are thus 
contemporary with the standard Monte Carlo method, developped by Ulam and 
von Neumann in the late 1940s. (Nicolas Metropolis is also credited with sug¬ 
gesting the name “Monte Carlo“, see Eckhardt, 1987, and published the very 
first Monte Carlo paper, see Metropolis and Ulam, 1949.) This Metropolis algo¬ 
rithm, while used in physics, was only generalized by Hastings (1970) and Peskun 
(1973, 1981) towards statistical applications, as a method apt to overcome the 
curse of dimensionality penalising regular Monte Carlo methods. Even those later 
generalisations and the work of Hammersley, Clifford, and Besag in the 1970’s 
did not truly impact the statistical community until Gernan and Genian (1984) 
experimented with the Gibbs sampler for image processing, Tanner and Wong 
(1987) created a form of Gibbs sampler for latent variable models and Gelfand 
and Smith (1990) extracted the quintessential aspects of Gibbs sampler to turn 
it into a universal principle and rekindle the appeal of the Metropolis-Hastings 
algorithm for Bayesian computation and beyond. 

3. IMPLEMENTATION DETAILS 

When working with a Metropolis-Hastings algorithm, the generic nature of 
Algorithm 1 is as much an hindrance as a blessing in that the principle remains 
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valid for almost every choice of the proposal q. It thus does not give indications 
about the calibration of this proposal. For instance, in Example 1, the method 
is valid for all choices of a but the comparison of the histogram of the outcome 
with the true density shows that a has a practical impact on the convergence 
of the algorithm and hence on the number of iterations it requires. Figure 4 
illustrates this divergence in performances via the autocorrelation graphs of three 
chains produced by the R code in Example 1 for three highly different values of 
a = 0.3, 3, 30. It shows why a = 3 should be prefered to the other two values 
in that each value of the Markov chain contains “more” information in that 
case. The fundamental difficulty when using the Metropolis-Hastings algorithm 
is in uncovering which calibration is appropriate without engaging into much 
experimentation or, in other words, in an as automated manner as possible. 

A (the?) generic version of the 
Metropolis-Hastings algorithm is the 
random walk Metropolis-Hastings al¬ 
gorithm, which exploits as little as pos¬ 
sible knowledge about the target dis¬ 
tribution, proceeding instead in a lo¬ 
cal if often myopic manner. To achieve 
this, the proposal distribution q aims 
at a local exploration of the neigh¬ 
borhood of the current value of the 
Markov chain, i.e., simulating the pro¬ 
posed value Y t as 

Y t = X® + e u 



Fig 4. Comparison of the acf of three Markov 
chains corresponding to scales a = 0.3, 3, 30, for 


T = 10’ 
3.14. 


iterations, and a starting value = 


where e* is a random perturbation with distribution g, for instance a uniform 
distribution as in Example 1 or a normal distribution. If we call random walk 
Metropolis-Hastings algorithms all the cases when g is symmetric, the acceptance 
probability in Algorithm 1 gets simplified into 


p(x, y ) = min 

While this probability is independent of the scale of the proposal g, we just saw 
that the performances of the algorithm are quite dependent on such quantities. In 
order to achieve an higher degree of efficiency, i.e., towards a decrease of the Monte 
Carlo variance, Roberts et al. (1997) studied a formal Gaussian setting aiming at 
the ideal acceptance rate. Indeed, Example 1 showed that acceptance rates that 
are either “too high” or “too low” slow down the convergence of the Markov chain. 
They then showed that the ideal variance in the proposal is twice the variance of 
the target or, equivalently, that the acceptance rate should be close to l /±. While 
this rule is only an indication (in the sense that it was primarily designed for a 
specific and asymptotic Gaussian environment), it provides a golden rule for the 
default calibration of random walk Metropolis-Hastings algorithms. 

3.1 Effective sample size 

We now consider the alternative of the effective sample size for comparing 
and calibrating MCMC algorithms. Even for a stationary Markov chain, using T 
iterations does not amount to simulating an iid sample from n that would lead to 













































7 


the same variability. Indeed, the empirical average (1) cannot be associated with 
the standard variance estimator 

T 

°T = T _ 1 Y [h(X^) - 3 T (h) T ) 

1 t=i 

due to the correlations amongst the X^’s. In this setting, the effective sample 
size is defined as the correction factor tt such that a ^/tt is the variance of the 
empirical average (1)). This quantity can be computed as in Geweke (1992) and 
Heidelberger and Welch (1983) by 

r T = T/n(h ), 

where n(h) is the autocorrelation associated with the sequence h(X^), 

OO 

K,(h) = 1 + 2 corr (ji{X^),h(X^)j , 
t =l 

estimated by spectrumO and effectiveSize from the R library coda, via the 
spectral density at zero. A rough alternative is to rely on subsampling, as in 
Robert and Casella (2009), so that X is approximately independent from 
The lag G is possibly determined in R via the autocorrelation function 

autocorr. 

Example 2. (Example 1 continued) We can compare the Markov chains 
obtained with a = 0.3, 3, 30 against those two criteria: 


> autocor(mcmc(x)) 

[,1] [> 2] [, 3] 

Lag 0 1.0000000 1.0000000 1.0000000 

Lag 1 0.9672805 0.9661440 0.9661440 

Lag 5 0.8809364 0.2383277 0.8396924 

Lag 10 0.8292220 0.0707092 0.7010028 
Lag 50 0.7037832 -0.033926 0.1223127 

> effectiveSize(x) 

[, 1 ] [, 2 ] [, 3 ] 

33.45704 1465.66551 172.17784 


This shows how much comparative improvement is brought by the value a = 3, 
but also that even this quasi-optimal case is far from an i.i.d. setting. ◄ 

3.2 In practice 

In practice, the above tools of ideal acceptance rate and of higher effective 
sample size give goals for calibrating Metropolis-Hastings algorithms. This means 
comparing a range of values of the parameters involved in the proposal and se¬ 
lecting the value that achieves the highest target for the adopted goal. For a 
multidimensional parameter, global maximisation run afoul of the curse of di¬ 
mensionality as exploring a grid of possible values quickly becomes impossible. 





8 


The solution to this difficulty stands in running partial optimisations, with sim¬ 
ulated (hence controlled) data, for instance setting all parameters but one fixed 
to the values used for the simulated data. If this is not possible, optimisation of 
the proposal parameters can be embedded in a Metropolis-within-Gibbs 3 since 
for each step several values of the corresponding parameter can be compared via 
the Metropolis-Hastings acceptance probability. 

We refer the reader to Chapter 8 
of Robert and Casella (2009) for more 
detailed descriptions of the calibra¬ 
tion of MCMC algorithms, including 
the use of adaptive mecchanisms. In¬ 
deed, calibration is normaly operated 
in a warm-up stage since, otherwise, 
if one continuously tune an MCMC al¬ 
gorithm according to its past outcome, 
the algorithm stops being Markovian. 

In order to preserve convergence in an 
adaptive MCMC algorithm, the solu¬ 
tion found in the literature for this 
difficulty is to progressively tone/tune 
down the adaptive aspect. For in- 

dition that states that the distance be¬ 
tween two consecutive Markov kernels must uniformly decrease to zero. For in¬ 
stance, a random walk proposal that relies on the empirical variance of the past 
sample as suggested in Haario et al. (1999) does satisfy this condition. An alter¬ 
native proposed by Roberts and Rosenthal (2009) proceeds by tuning the scale 
of a random walk for each component against the acceptance rate, which is the 
solution implemented in the amcmc package developed by Rosenthal (2007). 

4. ILLUSTRATION 

Kamary et al. (2014) consider the special case of a mixture of a Poisson and 
of a Geometric distributions with the same mean parameter A: 

aV( A) + (1 - a)Qeo(}/i +\), 

where A > 0 and 0 < a < 1. Given n observations (xi,... ,x n ) and a prior 
decomposed into 7 t(A) oc l /\ and tt (a) oc [a(l — a)] a °~ 1 , ao = 0.5 being the 
default value, the likelihood function is available in closed form as 

yj < a —- exp{—A} + (1 — a)X Xi (l + A) _Ii_1 
i=i L Xi ' 

where s n = x\ + ... + x n . In R code, this translates as 

3 The Metropolis-within-Gibbs algorithm aims at simulating a multidimensional distribution 
by successively simulating from some of the associated conditional distributions—this is the 
Gibbs part—and by using one Metropolis-Hastings step instead of an exact simulation scheme 
from this conditional, with the same validation as the original Gibbs sampler. 


stance, Roberts and Rosenthal (2009) 
propose a diminishing adaptation con- 
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Fig 5. Output of a two-dimensional ran¬ 
dom walk Metropolis-Hastings algorithm for 123 
observations from a Poisson distribution with 
mean 1, under the assumed model of a mixture 
between Poisson and Geometric distributions. 
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likelihood=function(x,lam,alp){ 

prod(alp*dpois(x,lam)+(1-alp)*dgeom(x,lam/(1+lam)))} 
posterior=function(x,lam,alp){ 

sum(log(alp*dpois(x,lam)+(1-alp)*dgeom(x,1/(1+lam))))- 
log(lam)+dbeta(alp,.5,.5,log=TRUE)} 


If we want to build a Metropolis-Hastings algorithm that simulates from the 
associated posterior, the proposal can proceed by either proposing a joint move on 
( a , A) or moving one parameter at a time in a Metropolis-within-Gibbs fashion. 
In the first case, we can imagine a random walk in two dimensions, 

a! ~ £(ea, e(l — a)) , X' ~ £A/"(log(A), 5(1 + log(A) 2 )), e,<5>0 

with an acceptance probability 

7r(c/, X'\x)q(a, A|c/, A') ^ 

7r (a, A| x)q(a, A|a', A') 

The Metropolis-Hastings R code would then be 


metropolis=function(x,lam,alp,eps=l,del=l){ 

prop=c(exp(rnorm(l,log(lam),sqrt(del*(l+log(lam)~2)))), 
rbeta(l,l+eps*alp,1+eps*(1-alp))) 
rat=posterior(x.prop[1],prop[2])-posterior(x,lam,alp)+ 

dbeta(alp,l+eps*prop[2],1+eps*(1-prop[2]),log=TRUE)- 
dbeta(prop[2],l+eps*alp,1+eps*(1-alp),log=TRUE)+ 
dnorm(logdam),log(prop [1]), 

sqrt(del*(l+log(prop[1])"2)),log=TRUE)- 
dnorm(log(prop[1]),log(lam), 

sqrt(del*(l+log(lam) ~2 )),log=TRUE)+ 
log(prop[1]/lam) 

if (log(runif(1))>rat) prop=c(lam,alp) 
return(prop)} 


where the ratio prop [1]/lam in the acceptance probability is just the Jacobian 
for the log-normal transform. Running the following R code 


T=le4 

x=rpois(123,lambda=l) 

para=matrix(c(mean(x),runif(1)),nrow=2,ncol=T) 

like=rep(0,T) 

for (t in 2:T){ 

para[,t]=metropolis(x,para[l,t-l],para[2,t—1],eps=.1,del=.1) 
like [t] =posterior(x.para[1,t],para[2,t])J 


then produced the histograms of Figure 5, after toying with the values of e and 
5 to achieve a large enough average acceptance probability, which is provided by 
length (unique (para[l,] ))/T The second version of the Metropolis-Hastings 









10 


algorithm we can test is to separately modify A by a random walk proposal, test 
whether or not it is acceptable, and repeat with a: the R code is then very similar 
to the above one: 


metropolis=function(x,1am,alp,eps=l,del=l){ 

prop=exp(rnorm(l,log(lam),sqrt(del*(l+log(lam)~2)))) 
rat=posterior(x.prop,alp)-posterior(x,lam,alp)+ 
dnorm(logdam),log(prop [1]) , 

sqrt(del*(1+log(prop[1])"2)),log=TRUE)- 
dnorm(log(prop[1]),log(lam), 

sqrt(del*(l+log(lam)~2)),log=TRUE)+ 
log(prop/lam) 

if (log(runif(1))>rat) prop=lam 
qrop=rbeta(l,l+eps*alp,1+eps*(1-alp)) 
rat=posterior(x.prop,qrop)-posterior(x,prop,alp)+ 

dbeta(alp,l+eps*qrop,1+eps*(1-qrop),log=TRUE)- 
dbeta(qrop,l+eps*alp,1+eps*(1-alp),log=TRUE) 
if (log(runif(1))>rat) qrop=alp 
return(c(prop,qrop))} 


In this special case, both algo¬ 
rithms thus return mostly equivalent 
outcomes, with a slightly more dis¬ 
persed output in the case of the 
Metropolis-within-Gibbs version (Fig¬ 
ure 7). In a more general perspective, 
calibrating random walks in multiple 
dimensions may prove unwieldly, es¬ 
pecially with large dimensions, while 
the Metropolis-within-Gibbs remains 
manageable. One drawback of the later 
is common to all Gibbs implementa¬ 
tions, namely that it induces higher 
correlations between the components, 
which means a slow convergence of the 
chain (and in extreme cases no conver¬ 
gence at all). 
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k a 

Fig 6. Output of a Metropolis-within-Gibbs ran¬ 
dom walk Metropolis-Hastings algorithm for 123 
observations from a Poisson distribution with 
mean 1, under the assumed model of a mixture 
between Poisson and Geometric distributions. 


5. EXTENSIONS 
5.1 Langevin algorithms 

An extension of the random walk Metropolis-Hastings algorithm is based on 
the Langevin diffusion solving 


dX t = y 2 V log ir(X t )dt + dB t , 


where Bt is the standard Brownian motion and V/ denotes the gradient of /, 
since this diffusion has n as its stationary and limiting distribution. The algorithm 
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is based on a discretised version of the above, namely 

Y n+ 1 \X n ~ J\f (x + h/ 2 V log 7T(z), fcV a I d ) , 

for a discretisation step h, which is used as a proposed value for X n+ \. and ac¬ 
cepted with the standard Metropolis-Hastings probability (Roberts and Tweedie, 
1995). This new proposal took the name of Metropolis adjusted Langevin al¬ 
gorithms (hence MALA). While computing (twice) the gradient of n at each 
iteration requires extra time, there is strong support for doing so, as MALA algo¬ 
rithms do provide noticeable speed-ups in convergence for most problems. Note 
that 7 r(-) only needs to be known up to a multiplicative constant because of the 
log transform. 

5.2 Particle MCMC 


Another extension of the Metropolis- 
Hastings algorithm is the particle 
MCMC (or pMCMC ), developed by 
Andrieu et al. (2011). While we can¬ 
not provide an introduction to par¬ 
ticle filters here, see, e.g., Del Moral 
et al. (2006), we want to point out 
the appeal of this approach in state 
space models like hidden Markov mod¬ 
els (HMM). This innovation is similar 
to the pseudo-marginal algorithm ap¬ 
proach of Beaumont (2003); Andrieu 
and Roberts (2009), taking advantage 
of the auxiliary variables exploited by 
particle filters. 

In the case of an HMM, i.e., where a 
latent Markov chain xq-t with density 

Po(xo\ 0 )pi(xi\x o , 6 ) ■ ■ ■ p T {x T \x T -\, 0 ), 
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Fig 7. Output of a Metropolis-within- Gibbs 
(blue) and of a two-dimensional (gold) ran¬ 
dom walk Metropolis-Hastings algorithm for 123 
observations from a Poisson distribution with 
mean 1, under the assumed model of a mixture 
between Poisson and Geometric distributions. 


is associated with an observed sequence yi+r such that 

T 

Vi+t\X\,t,0 ~ W_qi(yi\xi,Q ), 

Z— 1 

pMCMC applies as follows. At every iteration t, a value 9' of the parameter 
9 ~ f)(0|(?(d) is proposed, followed by a new value of the latent series x' 0 . T gen¬ 
erated from a particle filter approximation of p(xq : t\Q', Di-.t)- As the particle Li¬ 
ter produces in addition (Del Moral et al., 2006) an unbiased estimator of the 
marginal posterior of yi-.p, q(yi:T\ 9 '), this estimator can be directly included in 
the Metropolis-Hastings ratio 

g(yi:T\9'M9'M0U\9') 

qiyi-.Tm^M^) 

The validation of this substitution follows from the general argument of Andrieu 
and Roberts (2009) for pseudo-marginal techniques, even though additional ar¬ 
guments are required to establish that all random variables used therein are 
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accounted for (see Andrieu et al., 2011 and Wilkinson, 2011). We however stress 
that the general validation of those algorithm as converging to the joint poste¬ 
rior does not proceed from pseudo-marginal arguments. An extension of pMCMC 
called SMC 2 that approximates the sequential filtering distribution is proposed 
in Chopin et al. (2013). 

5.3 Pseudo-marginals 

As illustrated by the previous section, there are many settings where com¬ 
puting the target density 7r(-) is impossible. Another example is made of doubly 
intractable likelihoods (Murray et al., 2006a), when the likelihood function con¬ 
tains a term that is intractable, for instance t(6\x) oc g(x\9) with an intractable 
normalising constant 

3(9) = [ g(x\9)dx. 

J x 

This phenomenon is quite common in graphical models, as for instance for the 
Ising model (Murray et al., 2006b; Mpller et al., 2006). Solutions based on aux¬ 
iliary variables have been proposed (see, e.g., Murray et al., 2006a; Mpller et al., 
2006), but they may prove difficult to calibrate. 

In such settings, Andrieu and Roberts (2009) developped an approach based 
on an idea of Beaumont (2003), designing a valid Metropolis-Hastings algorithm 
that substitutes the intractable target vr(-|x) with an unbiased estimator. A slight 
change to the Metropolis-Hastings acceptance ratio ensures that the stationary 
density of the corresponding Markov chain is still equal to the target it. Indeed, 
provided n(9\z) is an unbiased estimator of n(9) when z ~ q('\9), it is rather 
straightforward to check that the acceptance ratio 

tt(9*\z*) q(9*, 0)q(z\6) 

Tt(9\z) q(6,9*)q(z*\6*) 

preserves stationarity with respect to an extended target (see Andrieu and Roberts 
(2009)) for details) when z* ~ q(-\9), and 6*\6 ~ q(9,9*). Andrieu and Vihola 
(2015) propose an alternative validation via auxiliary weights used in the un¬ 
biased estimation, assuming the unbiased estimator (or the weight) is generated 
conditional on the proposed value in the original Markov chain. The performances 
of pseudo-marginal solutions depend on the quality of the estimators i r and are 
always poorer than when using the exact target ir. In particular, improvements 
can be found by using multiple samples of z to estimate 7r (Andrieu and Vihola, 
2015). 


6. CONCLUSION AND NEW DIRECTIONS 

The Metropolis-Hastings algorithm is to be understood as a default or off-the- 
shelf solution, meaning that (a) it rarely achieves optimal rates of convergence 
(Mengersen and Tweedie, 1996) and may get into convergence difficulties if im¬ 
properly calibrated but (b) it can be combined with other solutions as a baseline 
solution, offering further local or more rarely global exploration to a taylored 
algorithm. Provided reversibility is preserved, it is indeed valid to mix several 
MCMC algorithms together, for instance picking one of the kernels at random or 
following a cycle (Tierney, 1994; Robert and Casella, 2004). Unless a proposal is 
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relatively expensive to compute or to implement, it rarely hurts to add an extra 
kernel into the MCMC machinery. 

This is not to state that the Metropolis-Hastings algorithm is the ultimate 
solution to all simulation and stochastic evaluation problems. For one thing, there 
exist settings where the intractability of the target is such that no practical 
MCMC solution is available. For another thing, there exist non reversible versions 
of those algorithms, like Hamiltonian (or hybrid) Monte Carlo (HMC) (Duane 
et al., 1987; Neal, 2013; Betancourt et al., 2014). This method starts by creating 
a completly artificial variable p, inspired by the momentum in physics, and a 
joint distribution on (q,p) which energy—minus the log-density—is defined by 
the Hamiltonian 

H{q, p) = - log 7T (q) + p T M~ 1 p/2 , 

where M is the so-called mass matrix. The second part in the target is called the 
kinetic energy, still by analogy with mechanics. When the joint vector (q,p) is 
driven by Hamilton’s equations 

d q 
At 
dp 
At 


dH dH 


= - jp- = -rr- = M 1 P 
op op 


dH 

dq 


d log 7T 
dq 


this dynamics preserves the joint distribution with density exp —H(p,q). If we 
could simulate exactly from this joint distribution of (q,p), a sample from ir(q) 
would be a by-product. In practice, the equation is only solved approximately and 
hence requires a Metropolis-Hastings correction. Its practical implementation is 
called the leapfrog approximation (Neal, 2013; Girolami and Calderhead, 2011) as 
it relies on a small discretisation step e, updating p and q via a modified Euler’s 
method called the leapfrog that is reversible and preserves volume as well. This 
discretised update can be repeated for an arbitrary number of steps. 

The appeal of HMC against other MCMC solutions is that the value of the 
Hamiltonian changes very little during the Metropolis step, while possibly pro¬ 
ducing a very different value of q. Intuitively, moving along level sets in the 
augmented space is almost energy-free, but if those move proceeds far enough, 
the Markov chain on q can reach distant regions, thus avoid the typical local 
nature of regular MCMC algorithms. This strengthe explains in part why a sta¬ 
tistical software like STAN (Stan Development Team, 2014) is mostly based on 
HMC moves. 

As a last direction for new MCMC solutions, let us point out the require¬ 
ments set by Big Data, i.e., in settings where the likelihood function cannot be 
cheaply evaluated for the entire dataset. See, e.g., Scott et al. (2013); Wang and 
Dunson (2013), for recent entries on different parallel ways of handling massive 
datasets, and Brockwell (2006); Strid (2010); Banterle et al. (2015) for delayed 
and prefetching MCMC techniques that avoid considering the entire likelihood 
at once. 
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